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COVARIANCE REGULARIZATION BY THRESHOLDING 

By Peter J. Bickel^ and Elizaveta Levina^ 

University of California, Berkeley and University of Michigan 

This paper considers regularizing a covariance matrix of p vari- 
ables estimated from n observations, by hard thresholding. We show 
that the thresholded estimate is consistent in the operator norm as 
long as the true covariance matrix is sparse in a suitable sense, the 
variables are Gaussian or sub-Gaussian, and (logp)/n— >0, and ob- 
tain explicit rates. The results are uniform over families of covariance 
matrices which satisfy a fairly natural notion of sparsity. We discuss 
an intuitive resampling scheme for threshold selection and prove a 
general cross-validation result that justifies this approach. We also 
compare thresholding to other covariance estimators in simulations 
and on an example from climate data. 

1. Introduction. Estimation of covariance matrices is important in a 
number of areas of statistical analysis, including dimension reduction by 
principal component analysis (PCA), classification by linear or quadratic 
discriminant analysis (LDA and QDA), establishing independence and con- 
ditional independence relations in the context of graphical models, and set- 
ting confidence intervals on linear functions of the means of the components. 
In recent years, many application areas where these tools are used have been 
dealing with very high-dimensional datasets, and sample sizes can be very 
small relative to dimension. Examples include genetic data, brain imaging, 
spectroscopic imaging, climate data and many others. 

It is well known by now that the empirical covariance matrix for samples 
of size n from a p-variate Gaussian distribution, 7V^(/i, Sp), is not a good 
estimator of the population covariance if p is large. Many results in random 
matrix theory illustrate this, from the classical Marcenko-Pastur law [29] 
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to the more recent work of Johnstone and his students on the theory of the 
largest eigenvalues [12, 23, 30] and associated eigenvectors [24]. However, 
with the exception of a method for estimating the covariance spectrum [11], 
these probabilistic results do not offer alternatives to the sample covariance 
matrix. 

Alternative estimators for large covariance matrices have therefore at- 
tracted a lot of attention recently. Two broad classes of covariance estima- 
tors have emerged: those that rely on a natural ordering among variables, 
and assume that variables far apart in the ordering are only weakly corre- 
lated, and those invariant to variable permutations. The first class includes 
regularizing the covariance matrix by banding or tapering [2, 3, 17], which 
we will discuss below. It also includes estimators based on regularizing the 
Cholesky factor of the inverse covariance matrix. These methods use the 
fact that the entries of the Cholesky factor have a regression interpreta- 
tion, which allows application of regression regularization tools such as the 
lasso and ridge penalties [21], or the nested lasso penalty [28] specifically de- 
signed for the ordered variables situation. Banding the Cholesky factor has 
also been proposed [3, 34]. These estimators are appropriate for a number 
of applications with ordered data (time series, spectroscopy, climate data). 
For climate applications and other spatial data, since there is no total order- 
ing on the plane, applying the Cholesky factor methodology is problematic; 
but as long as there is an appropriate metric on variable indexes (some- 
times, simple geographical distance can be used), banding or tapering the 
covariance matrix can be applied. 

However, there are many applications, for example, gene expression ar- 
rays, where there is no notion of distance between variables at all. These ap- 
plications require estimators invariant under variable permutations. Shrink- 
age estimators are in this category and have been proposed early on [7, 20]. 
More recently, Ledoit and Wolf [26] proposed an estimator where the op- 
timal amount of shrinkage is estimated from data. Shrinkage estimators 
shrink the overdispersed sample covariance eigenvalues, but they do not 
change the eigenvectors, which are also inconsistent [24], and do not result 
in sparse estimators. Several recent papers [5, 31, 35] construct a sparse 
permutation-invariant estimate of the inverse of the covariance matrix, also 
known as the concentration or precision matrix. Sparse concentration matri- 
ces are of interest in graphical models, since zero partial correlations imply 
a graph structure. The common approach of [5, 31, 35] is to add an Li 
(lasso) penalty on the entries of the concentration matrix to the normal 
likelihood, which results in shrinking some of the elements of the inverse 
to zero. In [31], it was shown that this method has a rate of convergence 
that is driven by (logp)/n and the sparsity of the truth. Computing this 
estimator is nontrivial for high dimensions and can be achieved either via 
a semidefinite programming algorithm [[5], [35]] or by using the Cholesky 
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decomposition to reparametrize the concentration matrix [31], but all of 
these are computationally intensive. A faster algorithm that employs the 
lasso was proposed by Friedman, Hastie and Tibshirani [16] . This approach 
has also been extended to more general penalties like SCAD [15] by Lam 
and Fan [25] and Fan, Fan and Lv [14]. In specific applications, there have 
been other permutation-invariant approaches that use different notions of 
sparsity: Zou, Hastie and Tibshirani [36] apply the lasso penalty to loadings 
in PCA to achieve sparse representation; d'Aspremont et al. [6] compute 
sparse principal components by semidefinite programming; Johnstone and 
Lu [24] regularize PCA by moving to a sparse basis and thresholding; and 
Fan, Fan and Lv [13] impose sparsity on the covariance via a factor model, 
which is often appropriate in finance applications. 

In this paper, we propose thresholding of the sample covariance matrix 
as a simple and permutation-invariant method of covariance regulariza- 
tion. This idea has been simultaneously and independently developed by 
El Karoui [10], who studied it under a special notion of sparsity called [3- 
sparsity (see details in Section 2.4). Here we develop a natural permutation- 
invariant notion of sparsity which, though more specialized than El Karoui's, 
seems easier to analyze and parallels the treatment in [3] which defines a 
class of models where banding is appropriate. Bickel and Levina [3] showed 
that, uniformly over the class of approximately "bandable" matrices, the 
banded estimator is consistent in the operator norm (also known as the ma- 
trix 2-norm, or spectral norm) for Gaussian data as long as (logp)/n — > 0. 

Here we show consistency of the thresholded estimator in the operator 
norm, uniformly over the class of matrices that satisfy our notion of spar- 
sity, as long as (logp)/n 0, and obtain explicit rates of convergence. There 
are various arguments to show that convergence in the operator norm implies 
convergence of eigenvalues and eigenvectors [3, 10], so this norm is particu- 
larly appropriate for PCA applications. The rate we obtain is slightly worse 
than the rate of banding when the variables are ordered, but the difference is 
not sharp. This is expected, since in the situation when variables are ordered, 
banding takes advantage of the underlying true structure. Thresholding, on 
the other hand, is applicable to many more situations. In fact, our treat- 
ment is in many respects similar to the pioneering work on thresholding of 
Donoho and Johnstone [8] and the recent work of Johnstone and Silverman 
[22] and Abramovich et al. [1]. 

The rest of this paper is organized as follows. In Section 2 we introduce 
the thresholding estimator and our notion of sparsity, prove the convergence 
result and compare to results of El Karoui (Section 2.4) and to banding (Sec- 
tion 2.5). In Section 3, we discuss a cross-validation approach to threshold 
selection, which is novel in this context, and prove a cross-validation result of 
general interest. Section 4 gives simulations comparing several permutation- 
invariant estimators and banding. Section 5 gives an example of thresholding 
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estimator applied to climate data and Section 6 gives a brief discussion. The 
Appendix contains more technical proofs. 

2. Asymptotic results for thresholding. We start by setting up notation. 
We write Amax(Af) = Ai(Af) > • • • > Ap(Af) = Amin(Af) for the eigenvalues of 
a matrix M. Following the notation of [3], we define, for any < r,s < oo 
and a. p X p matrix M, 

(1) = sup{||Mx||, : ||x||^ = 1}, 

where ||x||^ = J2^=i\^j\^ ■ particular, we write ||M|| = |1M||(2,2) for the 
operator norm, which for a symmetric matrix is given by 

||M|| = max |A,-(M)|. 

i<i<p 

For symmetric matrices, we have (see, e.g., [18]) 

(2) ||M|| < (||M||(i,i)||M||(^,^))^/2 = ||M||(i,i) =max^ \m,j\. 

We also use the Frobenius matrix norm, 

\\Mfp = ^mjj = tv{MM'^). 

We define the thresholding operator by 

(3) Ts{M) = [mijl{\mij\>s)], 



which we refer to as M thresholded at s. Note that preserves symmetry 
and is invariant under permutations of variable labels, but does not neces- 
sarily preserve positive definiteness. However, if 

(4) ||r,-ro||<e and Ami„(A/)>e, 

then Ta{M) is necessarily positive definite, since for all vectors v with ||f ||2 = 
1 we have v'^T^Mv > v^Mv -e> X„nniM) -e>0. 

2.1. A uniformity class of covariance matrices. Recall that the banding 
operator was defined in [3] as Bk{M) = [mijl{\i — j\ < k)]. The uniformity 
class of "approximately bandable" covariance matrices is defined by 

U{eo,a,C) = < S:max^{|o-ij| : \i- j\ > k} < Ck~" for all A; > 0, 

and < £0 < Km{^) < Amax(S) < 1/eo }■ 
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Here we define the uniformity class of covariance matrices invariant under 
permutations by 

Ur{q,co{p),M) = |s:c7ii < M,^^ |a,jf < Co{p), for allij, 

for < (7 < 1. Thus, if q = 0, 

(0, CO (p) , M) = I S : an < M, J2 1 i^v / 0) < cq (p) | , 

a class of sparse matrices. We will mainly write cq for cq{p) in the future. 
Note that 



Amax(5]) < max^ \crij\ < ^co(p), 
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by the bound (2). Thus, if we define, 

Ur{q, co{p),M, eo) = {S : E G Ur{q, co(p), M) and X,nm{^) >eo> 0}, 

we have a class analogous to (5). 

Naturally, there is a class of covariance matrices that satisfies both band- 
ing and thresholding conditions. Define a subclass of U{eo,a,C) by 

V{eo,a,C) = {S: < C7|i - for all > 1, 

and < eo < Xmm{^) < Amax(S) < 1/eo}- 

for a > 0. Evidently, 

V{e,a,C) cU{eo,a,Ci) 

for Ci<C{l + l/a). 

On the other hand, S G V{eo,a, C) implies 

Y' " - ° (a + l)(?-l' 
so that for a suitable choice of cq and M, 

V(eo,a,C) cZ^,(g,co,M) 

for g> 
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2.2. Main result. Suppose we observe n i.i.d. p-dimensional observations 
Xi , . . . , X„ distributed according to a distribution F, with i?X = (with- 
out loss of generahty), and i?(XX^) = S. We define the empirical (sample) 
covariance matrix by 

1 " 

(6) E = -5](Xfc-X)(X,-Xf, 

where X = J2k=i -^k, and write S = [^ij]. 

We have the following result which parallels the banding result (Theorem 
1) of Bickel and Levina [3]. 

Theorem 1. Suppose F is Gaussian. Then, uniformly onUT-{q,CQ{p),M), 
for sufficiently large M' , if 



(7) t. = M'/^°^^ 



n 



and ^ = o(l), then 

IIT. Op (c„,p) (If;) "-"'^) 

and uniformly on Ur{q, Co{p),M, Eq), 

\\{Tt„it)r'-j:~'\\=Op(co{p) 



\ogp\ (l-9)/2 



n 

Proof. Recall that, without loss of generality, we assumed i?X = 0. 
Begin with the decomposition, 

(8) S = S° - XX"^, 

where 

1 " 



Note that, by (8), 

(9) max — aij \ < max |(T^- — cjjj | + max \XiXj \ . 

id id id 

By a result of Saulis and Statulevicius [32] adapted for this context in 
Lemma 3 of [3], and an < M for all i, 



(10) P 



max |(5"j°- — (Tijl >t 

id 



< P^Cie 
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for \t\ < 5, where Ci, C2 and 5 are constants depending only on M. In 
particular, (10) holds if i = o(l). 

For the second term in (9), we have, by the union sum inequality, the 
Gaussian tail inequality and an < M for all i, 



(11) P 



max \Xj P > t 



Combining (10) and (11), we see that if ^ Q and t = tn is given by 
(7), then for M' sufficiently large. 



(12) inax\aij - aij\=Op(\h^^ 

i,j \ y 71 

We now recap an argument of Donoho and Johnstone [8] . Bound 
\\Tt{±) - S|| < \\Tti^) - S|| + \\Tt{±)-Tt{m- 
The first term above is bounded by 

p 

(13) max V \a^j\l{\aij\ <t)< t^~''co{p). 
On the other hand, 

\\Tt{t)-Tt{m 

p 

< max^ |(7ij|l(|(Tjj| > t, \aij\ < t) 
" i=i 

p 

(14) +max V |cJij|l(|(Tij| < t, \aij\ > t) 

P 

+ max^ I 

" i = l 
= I + II + III. 

Using (12), we have 



p 



III < nia,x \(Tij — aij\ max^ \o'ij\'^t = Op ( co{p)t 



'logp 



To bound term I, write 

p p 
I < max^ l&ij — aij\l{\aij\ > t, \aij\ <t) + max^ |(Tjj|l(|(Tij| < t) 

(15) 

< IV + V. 
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By (13), 

(16) V<ti^'^co(p). 

Now take 7G (0,1). Then, 

p 

IV < max ^ I |l(jo-ij| > t, \(7ij\ < -ft) 



(17) 



p 

+ max V l&ij - aij\l{\aij\ > t,-ft < \aij\ < t) 
< max l&ij — (Tij \ max A^j(l — 7) + co(j')(7t)~'^ max \aij — 



where Ni{a) = YTj=i ^il'^ij ~ > ^.t). Note that, for some 5 > 0, 



(18) 



maxA'^j(l - 7) > 



P 



max l&ij — (Tjj I > (1 — 7)t 



if t = 0(1), uniformly on U. By (18) and (16), and < 7 < 1, if 

(19) 2logp - n5t^ ^ -00, 

then 



(20) W = Op(co{p)t~i^ 

and, by (9) and (13), 

(21) 



'logp 



n 



l = Op(co{p)t-J'^^ + co{p)& 



For term II, we have 



II < max^[|(7ij - aij\ + |(Tjj|]l(|(5-ij| < t, \aij\ > t) 



(22) < max \aij — aij \ ^ Id^jj \^t) + t max ^ l(|'7jj | > t) 

i=i * i=i 



Op co(p)t- 



'logp 



n 



+ co{p)t 



Combining (21) and (22) and choosing t as in (7) establishes the first claim 
of the theorem. The second claim follows since 

||[r,„(s)ri-s-i|| = Op(||r,„(s)-s||) 
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uniformly on Ur{q,co{p),M,eo), where A = Qp{B) means A = Op{B) and 
B = Op{A). □ 

Theorem 2. Suppose F is Gaussian. Then, uniformly onUr{q,co{p),M), 



if t = M'y and M' is sufficiently large, 

(23) i||T.(E)-E||| = Op(c„(p)(l^fi)""). 

An analogous result holds for the inverse on Ur{q, co{p),M, eq). The proof 
of Theorem 2 is similar to the proof of Theorem 1 and can be found in the 
Appendix. 

2.3. The non-Gaussian case. We consider two cases here. If, for some 
r/>0, 

Ee'^^i < K < oo for all |t| < r]j, for all i,j, 



then the proof goes through verbatim, since result (10) still holds. The bound 



on maxj will always be at least the squared rate of maxjj \aij — cjjJ, 



hence we do not need normality for (11). 

In the second case, if we have, for some 7 > 0, 

E\Xij\'^^^+'^^ < K for all 

then by Markov's inequality 

j.-(l+7)/2 

(24) P[\aij - a.,\ >t]< KCi^)—^^. 

Thus the bound (10) becomes 



p'KCij)- 



and hence, 

.^2/(1+7) \ 



Therefore, taking t„ = ^1/2^^ ' ^^'^ that 

/ /r,2/(l+7)\ l-g 

(25) ||r,JS)-S||= Op (^co(p)(^i^^^j 

which is, we expect, minimax though this needs to be checked. 
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Op 



2.4. Comparison to thresholding results of El Karoui. El Karoui [10] 
shows as a special case that if: 

(i) E\Xj\^ < oo for all r, l<j<p, 

(ii) ajj < M < oo for all j , 

(iii) if aij / 0, \aij \ > Cn~"o, <ao < ^ - Sq < ^, 

(iv) S is /3-sparse, (3 = ^ — rj, > 0, 

(v) £^cG(0,oo), 

then, if t„ = Cn~°', a = ^ — 5(i> ao, 

(26) ||rt„(S)-S||^0. 

El Karoui's notion of /3-sparsity is such that our case g = is /^-sparse with 
cq{p) = Kp^ . Our results yield a rate of 

for 7 arbitrarily large. Since /? < ^ by assumption and p>i n we see that 
our result implies (26) under (i), (ii), (iv), (v) and our notion of sparsity. 
Thus, our result is stronger than his in the all moments case, again under 
our stronger notion of sparsity. El Karoui's full result, in fact, couples a 
maximal value of r in (i) with the largest possible value of (3. Unfortunately, 
this coupling involves (iii) which we do not require. Nevertheless, his result 
implies the corresponding consistency results of ours, if (iii) is ignored, when 
only existence of a finite set of moments is assumed. However, according to 
El Karoui (personal communication), (iii) is not needed for (26) in the case 
when our sparsity condition holds. 

2.5. Comparison to banding results of Bickel and Levina. Comparison is 
readily possible on V{eo,a,C). By Theorem 1 of [3] the best rate achievable 
using banding is 

On the other hand, by our Theorem 1, thresholding yields 

for q > ^jqry. Comparing exponents, we see that banding is slightly better in 
the situation where labels are meaningful, since we must have 

a 



l-q< 



a + l 

However, since 1 — q can be arbitrarily close to the difference is not 
sharp. Not surprisingly, as a ^ oo, the genuinely sparse case, the bounds 
both approach C-^)^/^. 
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3. Choice of threshold. The question of threshold selection seems to be 
hard to answer analytically. In fact, the aij have variances which depend on 
the distribution of (Xj , Xj ) through higher-order moments so it may in fact 
make sense to threshold differentially. We conjecture that this would not 
make much difference if we assume second and fourth moments bounded 
above and below. Ignoring this issue, we propose a cross-validation method 
analogous to the one used by Bickel and Levina [3] but made using the 
Frobenius metric which enables us to partly analyze it. 

3.1. Method. Split the sample randomly into two pieces of size ni and ?i2 
where a choice to be "justified" theoretically is ni = n(l — jj^), n2 = 

and repeat this N times. Let Si^jy, T,2^u be the empirical covariance matrices 
based on the ni and n2 observations, respectively, from the z/th split. Form 

N 



(27) 



1 



2 

2,1^11^ 



and choose s to minimize /^(s) (in practice for s ^ s^i 
will show that, under the conditions of Theorem 2, 

l-q/2 



0, En ^ ^) 



We 



(28) 



1 

P 



Op 



logp 



n 



co(p) 



uniformly on Ur{q, C(){p),M) for q> 0. Claim (28) is weaker than the desired 

'logp\ 



(29) 



1(2,2) 



Op 



n 



co{p) 



in terms of the norm, though the left-hand side of (28), the average of a set 
of eigenvalues, can be viewed as a reasonable proxy for the operator norm, 
the maximum of the same set of eigenvalues. 

We begin with two essential technical results of independent interest. 

3.2. An inequality. We note an inequality derivable from a classic one of 
Pinelis — see [33], for instance. 

Proposition 1. LeiUi,...,U„ be i.i.d. p-variate vectors with E\lJi\'^ < 
K , E\Ji = 0. Let vi, . . . , vj he fixed p-variate vectors of length 1. Define for 



Then, 
(30) 



E 



1=1 



max v.- X 

i<j<J ^ 



<CnlogJ^||Ui||2, 



where C is an absolute constant. 
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Proof. By symmetrization, 

2 



E 



i=l 



< 2Emip^[J2£i\(Ui - V'ifvj 



where \J[ are i.i.d. as Uj and independent of Uj, and {si} are ±1 with 
probabihty 1/2 and independent of |(Uj — U^)"^Vj|. Let 



(u.-u'jS-I, 



w,. 



Then, 



\i=l 



<E<E 



max 

i<j<J 



E« 



{Wij : 1 < i < n, 1 < J < J} 



max y^W, 



< Cn log Ji? max W. 



by Pinehs' inequality [33]. Thus 



E max ( V EiWij ] < Cn log JE max ((Ui - U-)^v, ) 

l<j'<J' — ^ ' 7 



^1=1 



l<j<J 

< 2Cn\ogJE max (UTv,^ 



□ 



3.3. A general result on V-fold cross-validation. We will prove our result 
for = 1 in (27). The nature of our argument in Theorem 3 is such that it 
is fairly easy to see that it applies to each term of the sum in (27) and thus 
holds not just for the "sample splitting" (N = 1) procedure, but also for the 
general 2-fold cross-validation procedure that is given by (27), and in fact 
for more general l^-fold cross-validation procedures. 

Let Wi, . . . , W„_|_B be i.i.d. Q-variate vectors with distribution P, with 
EpW = fJ-{P). Let fij, 1 < j < J be estimates of /x based on Wi, . . . , W„. 

For convenience, in this section we write |xp = ||x||2 = J2f=i x'j and (x, y) = 
x^y. Let 

L(/x, d) = 1/1 — dp. 

The oracle estimate fi° is defined by 

/i° = argmin |/i(-P) — fijl'^- 

j 
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The sample splitting estimate /i'^ is defined as follows. Let 

B 



Then, 



B . ^ 



fi^ = arg min | — fij \ 

j 



Here is our basic result which has in some form appeared in Gyorfi et al. 
([19], Chapter 7, Theorem 7.1, page 101), Bickel, Ritov and Zakai [4] and 
Dudoit and van der Laan [9]. The major public proof in [19] appears to be 
in error and does not directly apply to our case so we give the proof of our 
statement for completeness. 

Theorem 3. Suppose: 
(Al) |A°-MP)|' = J^p(r„); 

(A2) £^pmaxi<j<j |(vj, Wi — /i)|^ ^Cp{J) for any set vi,...,vj of unit 

vectors in MP ; 
(A3) p(J„)l2|i = o(r„). 

Then, 

(31) lA^ - KP)\^ = IA° - /^(^)l'(l + op{l)) = ^p{rn), 
where A = Qp{B) means that A = Op{B) and B = Op{A). 

We identify suitable Jn and Bn in our discussion of Theorem 4. 

Proof of Theorem 3. By definition, writing fj, = fi{P), 

(32) |A'-Wb|2<|A''-Wb|2, 
which is equivalent to 

(33) 2(A" - A°, Wb - /i) > I A' - /^l' - I A" - Ml'- 
But, 

(34) I ( A^ - A°, Wb - /i) I < I ( A' - Wb - /i) I + 1 ( A° - Wb - /x) I . 
Now, let 

Then we have 

(35) KA'^ - Wb - m)I < lA' - Ml max |(i>„ - /i)|. 
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and similarly for the other term. Now, by Proposition 1 and assumption 
(A2), 

(36) i^^max Wb - < C^^pi-J), 

where C is used generically. Therefore, after some algebra and Cauchy- 
Schwarz, by (32), 

|2 < nr.f^^^^—Ln^-^f.Adn'' _ „.| ^ _ „.h + \r.° - ,i.|2 



(37) I - /x|^ < Op J) j (l/ic _ ^1 + _ + _ ^|._ 
Letting — /ip = o„, we can rewrite (33) as 

(38) a„ < j)(«i/2 + ^1/2) + 

with probability 1 — e(C), with e(C) — > as C ^ oo. Using (iii), 
an<ay'op(ry2)+r„(l + op(l)). 

But by definition, 
and hence, 

ay^<op(ry^) + ry^(l + op(l)) 
and the theorem follows. □ 

We proceed to show the relevance of Theorem 3 in our context. As we 
indicated, it is enough to consider = 1, and for convenience write the 
observations as 

Xi , . . . , Xm , . . . , Xm+_B , 

where n = m + B. Form Si and the sample covariances of Xi, . . . , X.^ 
and Xm+i, . . . , Xm+B, respectively, and the estimates To(S), Ts(S) corre- 
sponding to the oracle and statistician. By Theorem 2, it is clear that for 
all T.eUr, 

(39) ||T„(E)-S||2, = 0p(^co(p)p(i^y 

and the same holds for To(S^), the oracle estimate applied to the covariance 
matrix computed with known means. 

Let the optimizing s and the oracle s be, in fact, obtained by searching 

over a grid {jJ^:G< j < Jn}- For selected S G Ur, Op in (39) can be 



COVARIANCE THRESHOLDING 



15 



turned into 0,p. To see this, consider, for example, an = 1, cjjj = e|i — 

for j. For e > sufficiently small, this matrix is positive definite and we 

can take the right-hand side of (39) to be r„ = Kco{p)p{^^^)^~'='^'^ for some 

K. For t = M'^f^ and M' sufficiently large, we know that (42) holds as 
an identity, which yields a contribution of at least r„. The remaining terms 
in the risk for Tt(S) can only increase it. The same holds for T,^. For S 
such as this with inft ||Tt(S) — = inff |[Tf(S'') — = Vn, we can state 
the following theorem. For simplicity of notation, in what follows we assume 
coip) = Co < oo. The general case follows by simply rescaling S by co{p). 

Theorem 4. Suppose Xj are Gaussian, T, (^Ur{q,CQ{p), M), and Op = 
rip in (39). Then, if Bn = ne{n,p), (log J)^ = o(n''/^co(p)(logp)^^''/^e(n,p)), 
then 

(40) ||r,(E) - Slli. = \\To{±) - S||i.(l + op(l)). 

Thus, snp{\\Ts{t)-J:\\j,:J:eUr{q,co{p),M)}=Kco{p)pC-^y~''/\ which 
is the optimal rate for the oracle as well. 

The proof of Theorem 4, which consists of several lemmas that allow us 
to apply the general Theorem 3, is given in the Appendix. 

Notes. 1. Evidently, with e{n,p) ~ (logn)"-*^ we can take J ^n'^, for 
any k < oo if q > 0, and if p ~ , even if q = 0. 

2. Similar results can be obtained for banding. 

3. The assumption of Gaussianity can be relaxed. By applying Corollary 
4.10 from Ledoux [27], we can include distributions F of X = Ae, where 
e = (ei, . . . ,ep)"^ and the ej are i.i.d. \£j \ < c < oo (thanks to N. El Karoui 
for pointing this out). 

4. Simulation results. The simulation results we present focus on com- 
paring banding, thresholding, and two more permutation-invariant estima- 
tors: the sample covariance and the shrinkage estimator of Ledoit and Wolf 
[26]. We consider the AR(1) population covariance model, 

(41) S=[a.,] = [/-^-|] 

with p = 0.7. The value of 0.7 was chosen so that the matrix is not very 
sparse (as would be the case with p < 0.5) but does have a fair number of very 
small entries (which would not be the case with p close to 1). For banding, 
we show results for the variables in their "correct" order, and permuted at 
random. All other estimators are invariant to variable permutations, so their 
results are the same for both of these scenarios. We consider three values of 
p = 30, 100, 200 and the sample size is fixed at n = 100. 
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Table 1 

Averages and standard errors over 100 replications of performance measures for AR(1) 

with p = 0.7 



p 


Sample 


Ledoit-Wolf 


Banding 


Banding perm. 


Thresholding 








Matrix 1-norm 






30 


3.87(0.07) 


3.36(0.05) 


2.54(0.05) 


3.85(0.07) 


3.28(0.05) 


100 


11.46(0.09) 


7.99(0.05) 


3.13(0.04) 


5.05(0.01) 


4.61(0.04) 


200 


22.00(0.14) 


11.82(0.06) 


3.34(0.03) 


5.09(0.01) 


4.99(0.01) 








Operator norm 






30 


1.95(0.04) 


1.69(0.03) 


1.38(0.03) 


1.92(0.04) 


1.90(0.04) 


100 


4.16(0.05) 


3.06(0.02) 


1.68(0.02) 


4.63(0.003) 


3.15(0.03) 


200 


6.68(0.06) 


3.80(0.01) 


1.80(0.02) 


4.67(0.002) 


3.64(0.02) 






Frobenius norm 






30 


3.19(0.04) 


2.89(0.03) 


2.42(0.03) 


3.21(0.04) 


3.42(0.03) 


100 


10.23(0.04) 


8.16(0.02) 


4.60(0.02) 


13.80(0.001) 


8.73(0.03) 


200 


20.24(0.05) 


14.02(0.02) 


6.61(0.03) 


19.61(0.001) 


13.79(0.03) 




Abs. difference between true and estimated largest eigenvalue 




30 


0.91(0.06) 


0.46(0.04) 


0.52(0.04) 


0.84(0.06) 


0.74(0.05) 


100 


2.86(0.06) 


0.43(0.03) 


0.38(0.03) 


4.24(0.01) 


1.07(0.05) 


200 


5.21(0.07) 


0.42(0.03) 


0.31(0.02) 


4.23(0.01) 


1.15(0.04) 




Abs. cosine of the ang' 


le between true and estimated 1st PC 




30 


0.77(0.03) 


0.77(0.03) 


0.81(0.01) 


0.76(0.03) 


0.70(0.02) 


100 


0.37(0.02) 


0.37(0.02) 


0.42(0.02) 


0.10(0.004) 


0.28(0.01) 


200 


0.27(0.02) 


0.27(0.02) 


0.26(0.01) 


0.06(0.003) 


0.18(0.01) 



Table 1 shows average losses and standard deviations over 100 replica- 
tions, as measured by three different matrix norms (matrix 1-norm which we 
denote || • ||(i,i), operator and Probenius norms). We also report the absolute 
difference in the largest eigenvalue, |Amax(5^^) — Amax(S)|, and the absolute 
value of the cosine of the angle between the estimated and true eigenvectors 
corresponding to the first eigenvalue. This assesses how accurate each of the 
estimators would be in estimating the first principal component. 



Table 2 

Averages and standard errors over 100 replications of selected band width and threshold 



p 


Banding k 


Banding perm, k 


Threshold t 


30 


4.36(0.07) 


24.57(0.14) 


0.33(0.004) 


100 


4.27(0.05) 


0.00(0) 


0.49(0.002) 


200 


4.22(0.04) 


0.00(0) 


0.55(0.001) 
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The results in Table 1 show what one would expect: when banding is given 
the correct order of variables, it performs better than thresholding, since it 
is taking advantage of the underlying structure. When banding is given the 
variables in the wrong order, it performs poorly, often worse than the sam- 
ple covariance matrix, and then thresholding is a much better choice. The 
Ledoit-Wolf estimator performs worse than thresholding by most measures, 
although it does well on estimating the largest eigenvalue. Note that the 
eigenvectors of the Ledoit-Wolf estimator are equal to the sample covari- 
ance eigenvectors. 

Table 2 shows the band width selected by the cross-validation procedure 
on correct and permuted orderings, and the threshold selection. Note that 
banding in permuted order always selects a diagonal model for both p = 100 
and p = 200, and keeps almost all the entries at p = 30, both of which result 
in bad estimators. The selected threshold increases with dimension, which is 
expected since in higher dimensions one would need to regularize more. The 
selected band width also goes down with dimension (the decrease on our 
range of p's is not very large, but results over a wider range of dimensions 
in [3] show the same pattern more clearly). 

Figure 1 shows scree plots of the true eigenvalues and means, 2.5% and 
97.5% percentiles of the estimates over 100 replications for p = 100. Interest- 
ingly, the results show that the sample covariance is very bad at estimating 
the leading eigenvalues, but better than thresholding on the middle part 
of the spectrum. The leading eigenvalues, however, are more important in 
applications like PCA. The Ledoit-Wolf estimator does better on eigenval- 
ues than on overall loss measures in Table 1. Banding in the correct order 
appears to do best on estimating the spectrum. For illustration purposes, 
scree plots from a single randomly selected realization are shown in Figure 
2. 

5. Climate data example. In this section, we illustrate the performance 
of the thresholded covariance estimator by applying it to climate data. The 
data are monthly mean temperatures recorded from January 1850 to June 
2006; only the January data were used in the analysis below (157 observa- 
tions). The region covered by the total of 2592 recording stations extends 
from —177.5 to 177.5 degrees longitude, and from —87.5 to 87.5 latitude. 
Not all the stations were in place for the entire period of 157 years; we do 
not impute the missing data in any way, but instead simply calculate spatial 
covariance from all the years available for any given pair of stations. 

EOFs (empirical orthogonal functions) are frequently used in spatio-temporal 
statistics to represent patterns in spatial data. They are simply the princi- 
pal components of the spatial covariance matrix, where observations over 
time are used as replications to calculate covariance between different spa- 
tial locations. EOFs are typically represented by spatial contour plots, which 



18 



P. J. BICKEL AND E. LEVINA 




Fig. 1. Scree plots: the mean estimated eigenvalues, their 2.5% and 97.5% percentiles, 
and the truth. 
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Index 



Fig. 2. Scree plot of single realization. 



provide a visual illustration of which regions contribute the most to which 
principal components. 

The plots in Figures 3 and 4 show the contour plots of the first four EOFs 
obtained, respectively, from the spatial sample covariance matrix (regular 
PCA), and from the thresholded spatial covariance matrix. We see that with 
thresholding, the first EOF essentially corresponds to Eurasia, and the sec- 
ond to North America, which the climate scientists agree should be separate. 
The regular PCA does not separate the continents. Ideal separation would 
be achieved if the estimator was block-diagonal (no nonzero correlations be- 
tween North America and Eurasia) . The thresholding estimator is in fact not 
block-diagonal, but does set enough correlations to zero to achieve effective 
separation of two continents in the EOFs. We also note that in this case the 
thresholding estimator does have some small negative eigenvalues, but they 
correspond to a negligible fraction of variance. 

6. Summary and discussion. We have proposed and analyzed a regular- 
ization by thresholding approach to estimation of large covariance matrices. 
One of its biggest advantages is its simplicity — hard thresholding carries no 
computational burden, unlike many other methods for covariance regulariza- 
tion. A potential disadvantage is the loss of positive definiteness — but since 
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EOF #i (14.S2% Variance) 




Longitude 
EOF #2 {8.68% Variance) 




Lcjugitude 



Fig. 3. First two EOFs for the January temperature data obtained from regular PC A. 

we show that for a suitably sparse class of matrices the estimator is con- 
sistent as long as (logp)/n — > 0, the estimator will be positive definite with 
probability tending to 1. We show consistency in the operator norm, which 
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EOF #1 (8.49% Variance) 




Limgitudi 



EOF #2 (7.82% Variance) 
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-150 -100 -50 50 100 150 

LouRituiie 



Fig. 4. First two EOFs for the January temperature data obtained from PCA on the 
thresholded covariance matrix. 
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guarantees consistency for principal components, hence we expect that PCA 
will be one of the most important applications of the method. 

We have also provided theoretical justification for the cross-validation 
approach to selecting the threshold. While it was formulated in the context 
of hard thresholding, the general result is much more widely applicable; in 
particular, it applies to other covariance estimation methods that depend 
on selecting the tuning parameter, such as [3, 21, 28, 31] and others. 

APPENDIX: ADDITIONAL PROOFS 

Proof of Theorem 2. The proof is essentially the same as for Theo- 
rem 1. We need to bound 

^{aabl{\aab\ >t)- CTabf- 
a,b 

As before, 

(42) Y.al,l{\aab\<t)<t^-'^pco{p). 

a,b 

Similarly, for instance, the analogue of term III is 

Y.i^'-b - CTabfm^abl > t, Kbl >t)< t-Vo(p)^(l + Op(l)) 
a.b 

(43) 

^o,(.,,,(!^)'-). 

The only new type of term arises in the analogue of term I. Note that 

^^IbHl^abl > t, \crab\ < t) 
a,b 

<PCo{p)t^~'^ + Y.\^ab-(^lb\m^ab\>t,\aab\<t). 
a,b 

The second term is bounded by 

^a,b\^lb - (^IblKt < \^ab\ < (1 + kafel < t) 
2 2 

+ T.a,bWab - (^ab\KWab\ > (1 +e)i, Wab\ < t). 

The second term in (44) is with probability tending to 1. The first is 
bounded by 

^a,b\^lb - 0"a6|l(* < \^ab\ < (1 + e)*, Wab\ < 7l*) 

(45) 

+ ^a,bWab-(^ab\Ki< Wab\ < (l+e)t,7li < Wab\ <t), 
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where 71 G (0, 1). Again, the first term in (45) is with probabihty tending 
to 1 and the second is bounded by 

{l + e)t'co{p)p^^H-'^ 

since 

max \ali, - ali,\ < max l&ab + (^abl max l&ab - (Tab\- 

a,b a,b a,b 

The theorem foUows by putting (42), (43) and the other remainder terms 
together. It is clear from the argument that, by restricting the result from 
the class Ur^q, co(p), M) to properly chosen S's, we can change Op into Qp. 
□ 

We need the following lemmas to apply Theorem 3 to the special case of 
Theorem 4. Let 

W,^{[Xi''>xi%l<a,b<p} 
where = {X^,. . . ,X^'^)^, so that Q=p^. 

Lemma A.l. Suppose T, G Ur{q,co, M), V = [vab] is symmetric p x p, 
= 1 and F is Gaussian. Then, 

Var(5]7;,feXWx(^M < pCi{q,Co, M), 

\ a,b I 

Proof. By Wick's theorem, E{XaXf,XcXd) = (Jabbed + CFac^bd + cradf^bc- 
Then for any V = [vab] as above, 

(46) ElY.'^-biXi^^X^^^ -aab)] = J2 ^abVcdiaa 
\ a,b / a,b,c,d 



acCTbd + O'adO'bc)- 

The two terms are equal because Vat = v^a- Consider the first term. Write 

E '"abVcdO'acO'bd 



a,b,c,d 



E ^E '^acVcdj ^E '^bdVai}j 



/ sl/2/ ^l/2/ ^l/2/ ^ 1/2 

<E(E-Lj (E-?.) (E-l] (llvlb) 

l/2\ 2 
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where we used 



Similarly, 



J2 C^lc < max kaeP"^ ^ kacr < M^^'^Cq. 



\a,b ) 

\a,b,c,d,j,k,l,m / 

1 1 

= ^ E '"abVcdic^acCTbd + (^adCTbc) + ^ E 



bVcd<y ab<y cd- 



a,b,c,d 



The first term is the same as (46). The second term is 



\a,b / \ a \ h 



1/2 



l/2\ 2 



and Lemma A.l follows. □ 

Lemma A. 2. SupposeT, €Ur{q,co,M), V is symmetric pxp, 11^11^7 = 1, 
Til = [XaX'^] and F is Gaussian. Let S = PAP^ be the eigendecomposition 
of T,, and |7i| > I72I > ■ • ■ be the eigenvalues of 



5 = Al/2pTyp^l/2_ 



Then, 
(47) 



P 



-1/2 



tr(yS?)-^7, 



fort ^00, K = K{q,co,M) >0, 5 = 5{q,co,M) > 0. 

Proof. Let Z ~ A/'p(0,/pxp) and write 
(48) tr(yS?) = tr(yPAV2zZ^AV2pr) = z^5Z ~ 7jZ2. 

By Lemma A.l, 



(49) 



J2j] = 1 Var(tr(yS?)) < pC{q, cq, M). 
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Let 7j = 7jp ^^'^ . In view of (48), to prove the right tail bound in (47) it is 
enough to show 



P 



■i=i 



for t ^ oo if 
(50) 



E7|<C(^,co,M). 
i=i 



Suppose first that ah the 7j are positive and 71 > 72- By the general form 
of Markov's inequality, 



(51) P 



^7,(^1 -l)>t 
■i=i 



< inf jexp (-St - ^ 7^5 J ]^ Ee''^= 
I V j=i / j=i 



^1 



The log of the function on the right-hand side of (51) is 



(52) 



^* + E[-7j^-llog(l-27is)]. 



The minimizer satisfies 

* = E[7j-(1 - 27.-^)"' - 7.-] = 2s 7^(1 - 27,s)-' 

(53) 



257^(1 -27is)-M 1 + 5] 

V j=2 



71 



(l-27is)(l-27,-s)^i 



(54) 



Write ijj = \ — 271S. Substituting into (53) and expanding: 
71(1 -cj) 

\ i=2 



V / ~ \ -1 

' 7i' 



ijj 



Now write if = + A/t, where = 7i(1 — 1^0)) and use = (71/O + 
0(t~^). Substituting into (54) and solving for A, we obtain after some com- 
putation. 



i=2 



71 



Note that tA is bounded by (50). Substituting back into (52) we finally 
obtain the bound 



,(-t/(27l))(l+o(l)) 
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If 7i has multiplicity m > 1, we can argue as above after pulling out 
(1 — 271)"'"/^ first. If the multiplicity of 71 is 1 but 71 — jj — > 0, the number 
of such terms is bounded above by (50) unless 71 itself tends to and we 
can argue as if all such 7^ are equal. Note also, that by (50) m^f is bounded 
unless 71 — > 0. But in that case we can obtain a better bound than (47) by 
applying Theorem 3.1 of Saulis and Statulevicius [32]. 

Finally, if not all the 7j are positive, we break up 

j=l j=l j=l 

and compute the bound for each summand. A similar but easier argument 
and a better bound hold for P\J2^=ilj{^j ~ 1) ^ ~t]- The lemma follows. 

□ 

Lemma A. 3. Under the conditions of Lemma A. 2 for E and each Vj, 
(55) p{J)<C{q,Co,M){logjfp. 



Proof. By Lemma A. 2, 



P 



(56) 



p max 

i<j<J 



k=l 



>t 



< 1(0 < t < x) + JKe-^^l{t > x) 
by applying the union sum inequality for t>x^oo. Integrating, we get 



-^E max (ii{Vjt\) - ^ 7^-^^ <x^ + JK H e~ 



Vt.5 



dt 



x^ + JK I " ve-^^dv 



P 



= x2 + Ji^(xe-^^ + rie-^^) 
Minimizing over x, we get that as J — > co, the minimizer satisfies 

x = AlogJ(l + o(l)) 

for A{C,K)< 00. Then, 

p~'Eme,x^(^triV,t'i) - 7,fc) < C(log J)^ 



and Lemma A. 3 follows. □ 
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A similar argument shows that under the conditions of Lemmas A.l and 
A.2 

\B-^C^{q,CQ,M)p{\ogjf. 

Proof of Theorem 4. We use Lemma A. 3 to bound p{J) and first 
obtain the equivalent of (40) for by plugging in p(J) < C(log J)^p. Take 
r„ = J as in the statement of Theorem 4, and check that 

the conditions of Theorem 3 are satisfied when applied to {Wj} and the J 
thresholding estimates that we optimize over for estimating £^(Wi) = S. 

The argument for S requires us to return to (35) with — replaced 
by Wb — XbX]J. By (57), (36) holds with this replacement on the left-hand 
side, and p{J) < C(logJ)^pB~^ . The analogue of Theorem 3 holds for this 
special case and Theorem 4 follows. □ 
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